class: center, middle, inverse, title-slide .title[ # Automatic Sampling and Analysis of YouTube Data ] .subtitle[ ## Introduction ] .author[ ### Johannes Breuer, Annika Deubel, & M. Rohangis Mohseni ] .date[ ### February 14th, 2023 ] --- layout: true --- ## Goals of this course After this course you should be able to... - automatically collect *YouTube* data - process/clean it - do some basic (exploratory) analyses of user comments --- ## About us **Johannes Breuer** .small[ - Senior researcher in the team *Data Augmentation*, department [*Survey Data Curation*](https://www.gesis.org/en/institute/departments/survey-data-curation) at *GESIS* - digital trace data for social science research - data linking (surveys + digital trace data) - (Co-)leader of the team *Research Data & Methods* at the [*Center for Advanced Internet Studies* (CAIS)](https://www.cais-research.de/) - Ph.D. in Psychology, University of Cologne - Research interests - Use and effects of digital media - Computational methods - Data management - Open science [johannes.breuer@gesis.org](mailto:johannes.breuer@gesis.org), [@MattEagle09](https://twitter.com/MattEagle09), [personal website](https://www.johannesbreuer.com/) ] --- ## About us **Annika Deubel** .small[ - M.Sc. in Applied Cognitive and Media Sciences (University of Duisburg-Essen) - Ph.D. candidate at the University of Duisburg-Essen - Researcher in the team *Research Data and Methods* at the *Center for Advanced Internet Studies* (CAIS) - Main area: health communication and information on social media - Other research interests: - Data and Algorithm Literacy - Computational methods [annika.deubel@cais-research.de](mailto:annika.deubel@cais-research.de), [@anndeub](https://twitter.com/anndeub) ] --- ## About us **M. Rohangis Mohseni** - Postdoctoral researcher (Media Psychology) at TU Ilmenau - Ph.D. in Psychology, University Osnabrueck - Ongoing habilitation "sexist online hate speech" 👿 - Other research interests - Electronic media effects - Moral behavior [rohangis.mohseni@tu-ilmenau.de](mailto:rohangis.mohseni@tu-ilmenau.de), [@romohseni](https://twitter.com/romohseni) --- ## About you - What's your name? - Where do you work? - What is your experience with `R`? - Why/how do you want to use *YouTube* for your research? --- ## Prerequisites for this course - Working version of `R` >= 4.0.0 and a recent version of [RStudio](https://rstudio.com/products/rstudio/download/#download) - Some basic knowledge of `R` - Interest in working with *YouTube* data --- ## Workshop Structure & Materials - The workshop consists of a combination of lectures and hands-on exercises - Slides and other materials are available at [https://github.com/jobreu/youtube-workshop-gesis-2023](https://github.com/jobreu/youtube-workshop-gesis-2023) We also put the PDF versions of the slides and some other materials on the [GESIS Ilias](https://ilias.gesis.org/) repository for this course. --- ## Online format - If possible, we invite you to turn on your camera - Feel free to ask questions anytime - If you have an immediate question during the lecture parts, please send it via text chat, publicly or privately (ideally to a person who is currently not presenting) - If you have a question that is not urgent and might be interesting for everybody, you can also use audio (& video) to ask it at the end of a lecture part or during the exercises (please use the use the "raise hand" function in *Zoom* for this) - We would kindly ask you to mute your microphones when you are not asking (or answering) a question --- ## Preliminaries: Base `R` vs. `tidyverse` In this course, we will use a mixture of base `R` and `tidyverse` code as Johannes prefers the `tidyverse`, and Annika and Ro use both. .small[ ICYC, here are some opinions [for](http://varianceexplained.org/r/teach-tidyverse/) and [against](https://blog.ephorie.de/why-i-dont-use-the-tidyverse) using/teaching the `tidyverse`. ] --- ## The `tidyverse` If you've never seen `tidyverse` code, the most important thing to know is the `%>%` [(pipe) operator](https://magrittr.tidyverse.org/reference/pipe.html). Put briefly, the pipe operator takes an object (which can be the result of a previous function) and pipes it (by default) as the first argument into the next function. This means that `function(arg1 = x)` is equivalent to `x %>% function()`. It may also be worthwhile to know/remember that `tidyverse` functions normally produce [`tibbles`](https://tibble.tidyverse.org/) which are a special type of dataframe (and most `tidyverse` functions also expect dataframes/tibbles as input to their first argument). --- ## The `tidyverse` If you want a short primer (or need a quick refresher) on the `tidyverse`, you can check out the blog [post by Dominic Royé](https://dominicroye.github.io/en/2020/a-very-short-introduction-to-tidyverse/). For a more in-depth exploration of the `tidyverse`, you can, e.g., have a look at the [workshop by Olivier Gimenez](https://oliviergimenez.github.io/intro_tidyverse/#1). And the book [*R for Data Science*](https://r4ds.had.co.nz/) by Hadley Wickham and Garrett Grolemund (which is available for free online) provides a very comprehensive introduction to the `tidyverse`. --- ## Preliminaries: What's in a name? Another thing you might notice when looking at our code is that we love 🐍 as much as 🐫. <img src="data:image/png;base64,#https://bookdown.org/joone/ComputationalMethods/img/horst/coding_cases.png" width="50%" style="display: block; margin: auto;" /> .center[ <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> ] --- ## Course schedule .center[**Tuesday, February 14th, 2023**] <table> <thead> <tr> <th style="text-align:center;"> When? </th> <th style="text-align:center;"> What? </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 09:00 - 10:00 </td> <td style="text-align:center;"> Introduction </td> </tr> <tr> <td style="text-align:center;"> 10:00 - 11:00 </td> <td style="text-align:center;"> The YouTube API </td> </tr> <tr> <td style="text-align:center;"> 11:00 - 11:15 </td> <td style="text-align:center;"> <i>Coffee Break</i> </td> </tr> <tr> <td style="text-align:center;"> 11:15 - 12:15 </td> <td style="text-align:center;"> Tools for collecting YouTube data </td> </tr> <tr> <td style="text-align:center;"> 12:15 - 13:15 </td> <td style="text-align:center;"> <i>Lunch Break</i> </td> </tr> <tr> <td style="text-align:center;"> 13:15 - 14:45 </td> <td style="text-align:center;"> Collecting YouTube data with R </td> </tr> <tr> <td style="text-align:center;"> 14:45 - 15:00 </td> <td style="text-align:center;"> <i>Coffee Break</i> </td> </tr> <tr> <td style="text-align:center;"> 15:00 - 16:30 </td> <td style="text-align:center;"> Processing and cleaning user comments </td> </tr> </tbody> </table> --- ## Course schedule .center[**Wednesday, February 15th, 2023**] <table> <thead> <tr> <th style="text-align:center;"> When? </th> <th style="text-align:center;"> What? </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 09:00 - 10:30 </td> <td style="text-align:center;"> Basic text analysis of user comments </td> </tr> <tr> <td style="text-align:center;"> 10:30 - 10:45 </td> <td style="text-align:center;"> <i>Coffee Break</i> </td> </tr> <tr> <td style="text-align:center;"> 10:45 - 12:15 </td> <td style="text-align:center;"> Sentiment analysis of user comments </td> </tr> <tr> <td style="text-align:center;"> 12:15 - 13:15 </td> <td style="text-align:center;"> <i>Lunch Break</i> </td> </tr> <tr> <td style="text-align:center;"> 13:15 - 14:45 </td> <td style="text-align:center;"> Excursus: Retrieving video subtitles </td> </tr> <tr> <td style="text-align:center;"> 14:45 - 15:00 </td> <td style="text-align:center;"> <i>Coffee Break</i> </td> </tr> <tr> <td style="text-align:center;"> 15:00 - 16:30 </td> <td style="text-align:center;"> Practice session, questions, and outlook </td> </tr> </tbody> </table> --- ## Why is *YouTube* relevant? - Important online video platform<br /><small>([Alexa Traffic Ranks, 2019](https://www.alexa.com/topsites); [Konijn, Veldhuis, & Plaisier, 2013](https://doi.org/10.1089/cyber.2012.0357))</small> - Esp. popular among adolescents who use it to, e.g., watch movies & shows, listen to music, and retrieve information<br /><small>([Feierabend, Plankenhorn, & Rathgeb, 2016](https://www.mpfs.de/studien/kim-studie/2016/))</small> - For adolescents, *YouTube* partly replaces TV<br /><small>([Defy Media, 2017](http://www.newsroom-publicismedia.fr/wp-content/uploads/2017/06/Defi-media-acumen-Youth_Video_Diet-mai-2017.pdf))</small> - YouTubers can be social media stars<br /><small>([Budzinski & Gaenssle, 2018](https://doi.org/10.1080/08997764.2020.1849228))</small> --- ## Why is *YouTube* data interesting for research? - Content producers and users generate huge amounts of data - These data can be useful for research on media content, communicators, and user interaction - The data are publicly available and relatively easy to retrieve via the *YouTube* API - For some further reasons and examples, see [Arthurs et al., 2019](https://doi.org/10.1177/1354856517737222); [Baertl, 2018](https://doi.org/10.1177/1354856517736979) --- ## Research Examples - Audience - Usage of YouTube<br /><small>([Defy Media, 2017](http://www.newsroom-publicismedia.fr/wp-content/uploads/2017/06/Defi-media-acumen-Youth_Video_Diet-mai-2017.pdf))</small> - Experiences with YouTube<br /><small>([Defy Media, 2017](http://www.newsroom-publicismedia.fr/wp-content/uploads/2017/06/Defi-media-acumen-Youth_Video_Diet-mai-2017.pdf); [Lange, 2007](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.3808&rep=rep1&type=pdf); [Moor et al., 2010](https://doi.org/10.1016/j.chb.2010.05.023); [Oksanen, et al. 2014](https://doi.org/10.1108/S1537-466120140000018021); [Szostak, 2013](https://journals.mcmaster.ca/mjc/article/view/280); [Yang et al., 2010](https://doi.org/10.1089/cyber.2009.0105))</small> - Video consumption<br /><small>([Montes-Vozmediano et al., 2018](https://doi.org/10.3916/C54-2018-06); [Tucker-McLaughlin, 2013](https://scholar.google.de/scholar?cluster=6669321353953252964))</small> - Radicalization<br /><small>([Albadi et al., 2022](https://arxiv.org/abs/2207.00111); [Ribeiro et al., 2020](https://doi.org/10.1145/3351095.3372879))</small> - Community formation<br /><small>([Kaiser & Rauchfleisch, 2020](https://doi.org/10.1177/2056305120969914))</small> --- ## Research Examples - Content - Incivility / Hate Speech in comments<br /><small>([Döring & Mohseni, 2019a](https://doi.org/10.1080/14680777.2018.1467945), [2019b](https://doi.org/10.1080/08824096.2019.1634533), [2020](https://doi.org/10.5771/2192-4007-2020-1-62); [Obadimu et al, 2019](https://doi.org/10.1007/978-3-030-21741-9_22); [Spörlein & Schlueter, 2021](https://doi.org/10.1093/esr/jcaa053); [Wotanis & McMillan, 2014](https://doi.org/10.1080/14680777.2014.882373))</small> - Commenter attributes<br /><small>([Literat & Kligler-Vilenchik, 2021](https://doi.org/10.1177/20563051211008821); [Röchert et al., 2020](https://doi.org/10.5117/CCR2020.1.004.ROCH); [Thelwall & Mas-Bleda, 2018](https://doi.org/10.1108/AJIM-09-2017-0204))</small> - Comment characteristics<br /><small>([Thelwall, 2018](https://doi.org/10.1080/13645579.2017.1381821); [Thelwall et al., 2012](https://doi.org/10.1002/asi.21679))</small> - Video content<br /><small>([Kohler & Dietrich, 2021](https://doi.org/10.3389/fcomm.2021.581302); [Utz & Wolfers, 2020](https://doi.org/10.1080/1369118X.2020.1804984))</small> --- ## Research Examples - Communicator - Video production<br /><small>([Utz & Wolfers, 2020](https://doi.org/10.1080/1369118X.2020.1804984))</small> - Extremism / Ideology<br /><small>([Rauchfleisch & Kaiser, 2020](https://doi.org/10.1080/08838151.2020.1799690), [2021](https://doi.org/10.2139/ssrn.3867818); [Dinkov et al., 2019](https://arxiv.org/abs/1910.08948); [Ribeiro et al., 2020](https://doi.org/10.1145/3351095.3372879))</small> - Gender / Diversity<br /><small>([Chen et al, 2021](https://doi.org/10.1177/14614448211034846); [Wegener et al., 2020](https://doi.org/10.5204/mcj.2728); [Thelwall & Mas-Bleda, 2018](https://doi.org/10.1108/AJIM-09-2017-0204))</small> - Economical aspects<br /><small>([Budzinski & Gaenssle, 2018](https://doi.org/10.1080/08997764.2020.1849228))</small> - Channel hierarchy / Ranking<br /><small>([Rieder et al., 2018](https://doi.org/10.1177/1354856517736982); [Rieder et al., 2020](https://doi.org/10.5210/fm.v25i8.10667))</small> --- class: center, middle # Any questions so far?